Efficient Algorithm for Removing Duplicate Documents

نویسندگان

  • Suresh Subramanian
  • Sivaprakasam
چکیده

Internet or Web world has a large amount of information, which may be html documents, word, pdf files, audio and video files, images etc. Huge challenges are being faced by the researches to provide the required and related documents to the users according to the user query. Additional overheads are available for researchers pertaining to identify the duplicate and near duplicate web documents. This paper addresses these issues through Genetic Algorithm and Duplicate Web Documents Identification Function is used to improve relevance of retrieved documents by removing the duplicate records from the dataset.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Near-duplicate Detection Algorithm to Facilitate Document Clustering

Web Ming faces huge problems due to Duplicate and Near Duplicate Web pages. Detecting Near Duplicates is very difficult in large collection of data like ”internet”. The presence of these web pages plays an important role in the performance degradation while integrating data from heterogeneous sources. These pages either increase the index storage space or increase the serving costs. Detecting t...

متن کامل

Re-ranking of Images and Removing Duplicate Images

Image re-ranking is one of the most efficient way by which you can improve the image search results and has been adopted by many search engines. Texture analysis is popular operation in CBIR. In this paper we have combined text based search along with CBIR and removing of duplicate images is done using pixel matching algorithm. Text based search is done using tags. CBIR is done using texture fe...

متن کامل

Near Duplicate Document Detection for Large Information Flows

Near duplicate documents and their detection are studied to identify info items that convey the same (or very similar) content, possibly surrounded by diverse sets of side information like metadata, advertisements, timestamps, web presentations and navigation supports, and so on. Identification of near duplicate information allows the implementation of selection policies aiming to optimize an i...

متن کامل

An Efficient Approach for Near-duplicate page detection in web crawling

The drastic development of the World Wide Web in the recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to the web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overhea...

متن کامل

Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling

Recent years have witnessed the drastic development of World Wide Web (WWW). Information is being accessible at the finger tip anytime anywhere through the massive web repository. The performance and reliability of web engines thus face huge problems due to the presence of enormous amount of web data. The voluminous amount of web documents has resulted in problems for search engines leading to ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014